Primary Models
Three models are fitted to the data, each with its own strengths, chosen to mask the weaknesses of the others. K-nearest neighbours with kernel smoothing is used to capture local similarity not only geospatially but throughout the entire feature space. Gradient boosted descision trees are used for their robustness and successful track record. Finally a generalised additive model is fit, fitting first splines to the features and then taking a linear combination of these nonlinear functions.
Before fitting, it is useful for tree and neighbourhood models to encode the factor variable type to a numeric. They are encoded in order of their mean price. talk about not best practices?
encode_type <- function(df) {
df$type_encoded = revalue(train0$type,replace=c('u'='1','t'='2','h'='3')) %>% as.numeric
return (df)
}
consider including a pic of error/sample size to justify not imputing but point out i explored it
Gradient Boosted Trees
The model fitting begins with gradient boosted descision trees, specifically the xgboost package, an R api for the popular software library.
Polar Coordinates
Looking at the map above, it appears that property value may be better parameterised with polar coordinates. Let’s set this up now with a function:
#----compute distance from city and bearings------------------------------------
polarise <- function(lnglat_centre,df) {
locs = select(df,c(lng,lat))
df$dist_cbd = apply(locs,1,function(loc){distm(loc,lnglat_centre)})
df$bearing =apply(select(df,c(lng,lat)),1,function(x){
bearing(lnglat_centre,x)
})
return(df)
}
K-Nearest Neighbours
point out epanechnikov chosen ahead of time, said to be optimal in a rmse sense, (said wiki, sourced a link, dont have money to buy the paper)
**acknowledge limitations of CV in that i chose feats before CV, not to worry, it wont
#---------------------------prepare data----------------------------------------
train0_knn <- encode_type(train0)
features <- c('building_area'
,'lng'
,'lat'
,'year_built'
,'type_encoded'
,'nrooms'
,'land_area'
)
train0_y = log(train0_knn$price)
train0_x <- train0_knn %>% select(features)
#---------------Prepare for caret's optimisation--------------------------------
#caret's trainControl function demands training indicies
#simply passing -folds won't work
train_folds = lapply(folds,function(f){which(!(1:nrow(train0_knn) %in% f))})
#using folds created earlier, create an object to pass to caret
#for cross validation
trainingParams = trainControl(index=train_folds,
indexOut=folds,
verbose=FALSE)
#create a dataframe of hyperparameter values to search
tunegrid = data.frame(kmax=rep(1:30,each=1),
distance=rep(1,30),
kernel=rep('epanechnikov',30)
)
#-----------------------------Run Caret-----------------------------------------
#Find the optimal kmax hyperparameter
knn_model = train(x=train0_x,
y=train0_y,
method='kknn',
metric='RMSE',
tuneGrid = tunegrid,
trControl = trainingParams
)
#save the optimal hyperparamaters
tuned <- knn_model$bestTune
#---------------Create out-of-fold (OOF) predictions----------------------------
oof_preds = rep(NA,nrow(train0_knn))
#iterate over folds
for (i in 1:KFOLD) {
fold = folds[[i]]
#prepare fold specific data
train0_x_fold = train0_x[-fold,]
train0_y_fold = train0_y[-fold]
val0_x_fold = train0_x[fold,]
val0_y_fold = train0_y[fold]
#call caret's train() to find the optimal kmax
model = train(x=train0_x_fold,
y=train0_y_fold,
method='kknn',
metric='RMSE',
tuneGrid=tuned,
trControl = trainControl('none')
)
#compute prediction on the remaining fold
preds = predict(model$finalModel,newdata=val0_x_fold)
#add to the OOF predictions
oof_preds[fold] = preds
}
knn_oof_error = sqrt(mean((oof_preds-train0_y)^2))
#-------------------------------------------plot--------------------------------
res <- knn_model$results
res <- select(res,c(kmax,distance,RMSE,MAE))
ggplot(res,aes(x=kmax)) + geom_line(aes(y=RMSE),lwd=2,color='#0078D7')
Generalised Additive Model
Now an additive model is fitted using the gam package. This model was fitted last as it relies the most on an understanding of the data.
The model expression was constructed manually, the result of first exploration and then tuning the splines’ freedom. Simplicity was favoured over complexity to avoid overfitting as a consequence of lengthy experimentation with the data.
caret was not used this time, it has a habit of oversimplifying the fitting interface.